Enabling large-scale screening of Barrett’s esophagus using weakly supervised deep learning in histopathology

Timely detection of Barrett’s esophagus, the pre-malignant condition of esophageal adenocarcinoma, can improve patient survival rates. The Cytosponge-TFF3 test, a non-endoscopic minimally invasive procedure, has been used for diagnosing intestinal metaplasia in Barrett’s. However, it depends on pathologist’s assessment of two slides stained with H&E and the immunohistochemical biomarker TFF3. This resource-intensive clinical workflow limits large-scale screening in the at-risk population. To improve screening capacity, we propose a deep learning approach for detecting Barrett’s from routinely stained H&E slides. The approach solely relies on diagnostic labels, eliminating the need for expensive localized expert annotations. We train and independently validate our approach on two clinical trial datasets, totaling 1866 patients. We achieve 91.4% and 87.3% AUROCs on discovery and external test datasets for the H&E model, comparable to the TFF3 model. Our proposed semi-automated clinical workflow can reduce pathologists’ workload to 48% without sacrificing diagnostic performance, enabling pathologists to prioritize high risk cases.


Patient demographics
Supplementary

Summary of key study elements
In this section, we summarize the key study elements of our paper listed according to recent reporting guidelines [9,10] for applications of machine learning (ML) in clinical research.We present the study setup (Supplementary Table 8), model details (Supplementary Table 9), experimental details (Supplementary  1.

Model architecture ML method and rationale
Weakly supervised multiple instance learning (MIL) method can achieve a high diagnostic performance to detect BE from H&E slides due to a strong alignment with the nature of the task: a slide is labeled BE positive when goblet cells are detected in a small area of the whole-slide image -this is a classical MIL task and resembles the assessment criteria of expert histopathologists.We use a weakly supervised deep learning network architecture inspired by Transformer-MIL proposed in [8].
The resulting model architecture is called BE-TransMIL.Benchmarked encoders: ResNet18, ResNet50, DenseNet121, Swin-T (see Methods: Model architecture).Features Learnable features selected by the deep learning model.Interpretability analysis highlights the following features with higher attention values given by the model (see Results, Fig. 2, Fig. 3, Fig. 5).H&E slides: Mucin-containing goblet cells are visible with a distinct cellular morphology in the high-attention tiles.TFF3 slides: Goblet cells show positive staining of the brown histochemical stain in the high-attention tiles.
-Eight NVIDIA V100 GPUs for training, one NVIDIA V100 GPU for inference.

Interpretability analysis
We analyze the attentions of the model qualitatively and quantitatively, correlate the findings with TFF3 stain for which goblet cells show positive staining, and analyze failure modes to ensure model outputs are interpretable (See Results).The analysis includes 1) Visual assessment of slide attention heatmaps and top/bottom attention tiles to analyze the visual features selected by the models to make a decision.2) GradCAM saliency maps to analyze fine-grained attention in tiles.3) Stain-attention correspondence analysis to correlate the attentions with TFF3 stain ratio.4) Failure modes analysis to assess the failures of shared and individual mistakes of H&E and TFF3 models.

Supplementary Fig. 1 :b 4 :
Qualitative analysis of hematoxylin and eosin (H&E) BE-TransMIL model's capability to generalize to out of distribution data.a, Goblet cells can be seen in the true positive slide high attention tiles; low attention tiles do not show any goblet cells.b, Slide attention heatmap of true negative slide shows nearly uniform attention; high-and low-attention tiles do not have any goblet cells.a Supplementary Fig. 3: Grad-CAM saliency maps of the top 10 tiles (with highest attention values) of a BE-positive slide (ResNet50, layer 4).a, H&E BE-TransMIL model.b, TFF3 BE-TransMIL model.Registration of adjacent TFF3 and H&E slides.
of cases from the DELTA implementation study.For each fold in four-fold cross-validation, discovery training set consists of 912 slides and validation set of 229 slides for model selection and comparison.Discovery test set consists of 229 slides for model evaluation.External model validation External dataset comprises of slide images from the BEST2 case-control clinical trial.The external validation set consists of 725 slides.

Table 5 :
Failure quantities computed for H&E and TFF3 BE-TransMIL models.Percentages in parentheses are with respect to the total number of errors for each failure type.

Table 7 :
Patient demographics for discovery and external evaluation datasets.

Table 8 :
Study setup